Abstract
Fairy Tales span across cultures, topics, and time periods. They have simple plots which deliver a clear message. For these reasons fairy tales are useful for text analysis. This project uses a corpus of 61 fairy tales from Charles Perrault and the Brothers Grimm to test how written sentiment in fairy tales differs across topic, story line and culture. Sentiment is quantified and compared using three lexicons: BING, AFINN and NRC. The results of sentiment analysis in this project indicate that written sentiment differs more across fairy tales of different topics than it does between cultural variations of common fairy tales. Despite common sentiment between versions, sentiment analysis throughout a fairy tale at a sentence level shows how cultural versions of stories have endings with different sentiment. This project conjectures that sentiment analysis across a story can be a useful technique for identifying critical sentences of emotion within a story.Cinderella, Sleeping Beauty, Snow White – classic stories with a tale as old as time. The origins of these and other fairy tales begin long before the Disney versions known today. Written versions of these stories are found in the publications of famous folklorists such as Charles Perrault and the Brothers Grimm [1]. These tales both entertain and intrigue academic interest for textual analysis.
Since fairy tales are tailored to children, they are well suited for text analysis. Vaz et. al. identifies that in comparison to text written for adults, fairy tales have shorter sentences, clearly defined emotions, and a plot and language easily read and understood [2]. Additionally, cultural, generational, and topic differences in fairy tale versions make it possible to research the affect of culture, time, and topic in the use of written sentiment.
Thanks to Project Gutenberg, a volunteer organization that promotes public domain ebooks, 57,000 texts are available to the public free of charge [3]. In this collection are books of Fairy Tales spanning multiple cultures, time periods, and topics. This project is scoped primarily on cultural and topic differences. Fairy Tales with cultural and topic differences were gathered from Household Fairy Tales by the Brothers Grimm, and The Tales of Mother Goose by Charles Perrault. This sample allows for comparison between French and German version of fairy tales, and an assortment of fairy tales with varying topics. Using a corpus built from these texts, this project aims to answer the following three research questions.
Thirteen packages are used for this project; they are necessary to follow along with the methodology and reproduce the results. The pacman package is used to facilitate installing and loading these necessary packages. The pacman package should be installed and loaded prior to proceeding.
pacman::p_load(XML,
rvest,
RCurl,
rprojroot,
tidytext,
stringr,
pdftools,
tidyr,
dplyr,
yaml,
ggplot2,
gutenbergr,
xts)
The three research questions in this project use sentiment analysis. Sentiment Analysis is “the task of extracting the positive or negative orientation that a writer expresses in a text” [4]. In this project sentiment of a text is quantified within the tidyverse ecosystem, where “text [is considered] as a combination of its individual words and the sentiment content of the whole text [is] the sum of the sentiment content of the individual words” [5]. Within the tidy universe, word sentiment can be classified using three different lexicons: AFINN, BING, or NRC. These lexicons are all limited to unigrams and use a dictionary of terms sourced and validated from modern day English. The scoring of sentiment differs between lexicons. The AFINN lexicon ranks sentiment on an integer between -5 (negative) and +5 (positive), the BING lexicon ranks words as -1 (negative), 0 (neutral), or 1 (positive), and the NRC lexicon further subdivides binarily ranked sentiment to categories of “positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust” [5]. The following output shows:
# A tibble: 4 x 2
lexicon noOfWords
<chr> <int>
1 AFINN 2476
2 bing 6788
3 loughran 4149
4 nrc 13901
# A tibble: 12 x 3
# Groups: lexicon [?]
lexicon sentiment noOfWords
<chr> <chr> <int>
1 bing negative 4782
2 bing positive 2006
3 nrc anger 1247
4 nrc anticipation 839
5 nrc disgust 1058
6 nrc fear 1476
7 nrc joy 689
8 nrc negative 3324
9 nrc positive 2312
10 nrc sadness 1191
11 nrc surprise 534
12 nrc trust 1231
# A tibble: 11 x 3
# Groups: lexicon [1]
lexicon `cut_width(score, 1)` n
<chr> <fct> <int>
1 AFINN [-5.5,-4.5] 16
2 AFINN (-4.5,-3.5] 43
3 AFINN (-3.5,-2.5] 264
4 AFINN (-2.5,-1.5] 965
5 AFINN (-1.5,-0.5] 309
6 AFINN (-0.5,0.5] 1
7 AFINN (0.5,1.5] 208
8 AFINN (1.5,2.5] 448
9 AFINN (2.5,3.5] 172
10 AFINN (3.5,4.5] 45
11 AFINN (4.5,5.5] 5
Since the number of sentiment associated words and the quantification of word sentiment differs by lexicon, this project compares sentiment analysis using the three available lexicons to see which works best for comparing sentiment in Fairy Tale’s. Additionally, since the lexicons consist of modern day English terms it is possible that the sentiment of some words in 18th and 19th century folktales is inconsistent or not included in the lexicons used for this project. While this concern is noted, this project proceeds with the three available lexicons and recommends future incorporation of a subject specific sentiment lexicon when developed and made compatable with the tidytext package in R.
The first research question concerns identifying whether fairy tales of varying topics focus on different emotions. This requires categorizing fairy tales into different criteria. This project uses the Aarne-Thompson-Uther classification system to distinguish between classes of fairy tales. The topic categories and corresponding indices within each topic are provided in the following table.
\[\begin{array}[] TTopic & Index Range \\ Animal Tales & 1-299 \\ Tales of Magic & 300-749 \\ Religious Tales & 750-849 \\ Realistic Tales & 850-899 \\ Tales of the Stupid Ogre & 1000-1199 \\ Anecdotes and Jokes & 1200-1999 \\ Formula Tales & 2000-2399 \end{array}\]Each fairy tale was manually indexed based on the classifications provided by [6] and [7]. The code below identifies the titles of fairy tales classified within each category based on the Aarne-Thompson-Uther system. Of the 61 fairy tales in the collection, 3 titles were not assigned to a category. For this project, undesignated fairy tales are excluded when comparing sentiment between topics. The following code identifies the Fairy Tales within each topic.
AnimalTales <- c(" THE WOLF AND THE SEVEN LITTLE GOATS.",
" THE WONDERFUL MUSICIAN",
" THE STRAW, THE COAL, AND THE BEAN",
" THE MOUSE, THE BIRD, AND THE SAUSAGE",
" THE BREMEN TOWN MUSICIANS",
" HOW MRS FOX MARRIED AGAIN FIRST VERSION",
" HOW MRS FOX MARRIED AGAIN SECOND VERSION",
" MR KORBES",
" OLD SULTAN",
" THE DOG AND THE SPARROW"
)
TalesOfMagic <- c("CINDERELLA, OR THE LITTLE GLASS SLIPPER.",
" THE SLEEPING BEAUTY IN THE WOODS.",
" LITTLE THUMB.",
" THE MASTER CAT, OR PUSS IN BOOTS.",
" RIQUET WITH THE TUFT.",
" BLUE BEARD.",
" THE FAIRY.",
" LITTLE RED RIDING-HOOD.",
" SIX SOLDIERS OF FORTUNE",
" THE GOOSE GIRL.",
" THE RAVEN",
" THE FROG PRINCE",
" FAITHFUL JOHN",
" THE TWELVE BROTHERS",
" THE BROTHER AND SISTER",
" RAPUNZEL",
" THE THREE LITTLE MEN IN THE WOOD",
" THE THREE SPINSTERS",
" HANSEL AND GRETHEL",
" THE WHITE SNAKE",
" THE FISHERMAN AND HIS WIFE",
" ASCHENPUTTEL",
" MOTHER HULDA",
" LITTLE RED CAP",
" THE TABLE, THE ASS, AND THE STICK.",
" TOM THUMB",
" THE ELVES",
" TOM THUMB'S TRAVELS",
" THE ALMOND TREE",
" THE SIX SWANS",
" THE SLEEPING BEAUTY",
" SNOW-WHITE",
" THE KNAPSACK, THE HAT, AND THE HORN",
" RUMPELSTILTSKIN",
" THE GOLDEN BIRD",
" THE QUEEN BEE",
" THE GOLDEN GOOSE"
)
TalesOfTheStupidOgre <- c(" ROLAND")
RealisticTales <- c(" THE ROBBER BRIDEGROOM",
" KING THRUSHBEARD"
)
AnecdotesAndJokes <- c(" CLEVER GRETHEL",
" HANS IN LUCK",
" THE GALLANT TAILOR",
" CLEVER ELSE",
" HOW MRS FOX MARRIED AGAIN FIRST VERSION",
" HOW MRS FOX MARRIED AGAIN SECOND VERSION",
" FRED AND KATE",
" THE LITTLE FARMER"
)
FormulaTales <- c(" THE DEATH OF THE HEN")
NoCategoryTales <- c("THE RABBIT'S BRIDE",
" THE VAGABONDS",
" PRUDENT HANS"
)
The second research question concerns identifying whether written sentiment in fairy tales vary throughout the course of a story. The motivation for this question is based in Vladimir Propp’s theory of folk tale morphology [8]. Propp, a Russian folklorist, identified 31 narrative units common across folk tales. These 31 narrative chunks were grouped into four spheres: the introduction, the body of the story, the donor sequence, and the hero’s return. While not every story contains all 31 narrative chunks and perhaps the sequence of these chunks is inconsistent between stories, it motivates asking whether written sentiment is different between narrative units and if so, does a common distribution of sentiment throughout fairy tales emerge.
Unfortunately, the corpus of text for this project is not tagged for Propp’s 31 narrative units and it is beyond the scope of this project to attempt to create these tags. Therefore, to show the distribution of sentiment throughout each fairy tale the sentiment is computed for each sentence. Sentence sentiment is quantifed by summing word sentiment within a given sentence and then normalizing the score to account for sentence length. The distribution of sentiment in a story can then be visualized as a time series of normalized sentence sentiment scores.
The final research question concerns identifying whether written sentiment varies between cultural versions of fairy tales. As a case study, versions of Cinderella, Sleeping Beauty, and Little Red Riding Hood are compared between French and German authors, Perrault and the Brothers Grimm.
The gutenbergr package by David Robinson provides access to the Project Gutenberg collection from within R. In this project, the works from the Brothers Grimm and Charles Perrault are accessed from the function gutenberg_download() that downloads each eBook by referencing the Project Gutenberg ID.
grimm <- gutenberg_download(19068)
Perrault <- gutenberg_download(17208)
After uploading the raw data, irrelevant text that is not part of the fairy tales must be removed. This includes removing introductory text at the start of the book, concluding text at the end of the book, and illustration placeholders. Additionally, it is necessary to format the text so that we can easily parse the text by fairy tale titles.
grimm <- grimm %>%
mutate(gutenberg_id = "Brothers Grimm")%>%
filter(row_number()>238)%>% #Remove text at beginning of book
filter(row_number()<10806)%>% #Remove text at end of book
filter(!str_detect(text, regex("^ \\[Illustration")))%>% #Remove Illustration Placeholders
filter(!str_detect(text, regex("^\\[Illust"))) #Remove Illustration Placeholders
replace_grimm6569 ="HOW MRS FOX MARRIED AGAIN FIRST VERSION"
replace_grimm6637 ="HOW MRS FOX MARRIED AGAIN SECOND VERSION"
grimm$text <- c(grimm$text[1:382],toupper(grimm$text[383]),
grimm$text[384:705],tolower(grimm$text[706:709]),grimm$text[710:1451],
toupper(grimm$text[1452]), grimm$text[1453:1576], tolower(grimm$text[1577:1583]), grimm$text[1584:2707],
tolower(grimm$text[2708:2709]),grimm$text[2710:2839], toupper(grimm$text[2840]),
grimm$text[2841:3744],toupper(grimm$text[3745]),
grimm$text[3746:3803],toupper(grimm$text[3804]),
grimm$text[3805:4868],toupper(grimm$text[4869]),
grimm$text[4870:4955],tolower(grimm$text[4956:4961]), grimm$text[4962:5844],
toupper(grimm$text[5845]), grimm$text[5846:6568],replace_grimm6569,
grimm$text[6570:6636], replace_grimm6637, grimm$text[6638:6834],
tolower(grimm$text[6835:6837]),
grimm$text[6838:7242], tolower(grimm$text[7243:7244]), grimm$text[7245:7768],
tolower(grimm$text[7769:7773]),grimm$text[7774:8277], tolower(grimm$text[8278:8281]),
grimm$text[8282:8677],toupper(grimm$text[8678]),
grimm$text[8679:9241],tolower(grimm$text[9242:9244]),grimm$text[9245:9535],
toupper(grimm$text[9536]),grimm$text[9537:9695],toupper(grimm$text[9696]),
grimm$text[9697:10589])
Perrault <- Perrault %>%
mutate(gutenberg_id = "Perrault")%>%
filter(row_number()>139)%>% #Remove text at beginning of book
filter(row_number()<1867)%>% #Remove text at end of book
filter(!str_detect(text, regex("^ \\[Illustration")))%>% #Remove Illustration Placeholders
filter(!str_detect(text, regex("^\\[Illust"))) #Remove Illustration Placeholders
After cleaning the data, for each eBook, the raw text is formatted as a data frame in which each observation is a line of text from the eBook. In total, there are 61 fairy tales that comprise the corpus for this project. However, in its original format, the text of each book is not seperated by story. Nevertheless, preceding each story is a fully capitalized title as shown below for the first fairy tale in Household Fairy Tales by the Brothers Grimm. Fairy tale titles are similarly capitalized in the text for Charles Perrault.
grimm[1:6,2]
# A tibble: 6 x 1
text
<chr>
1 THE RABBIT'S BRIDE
2 ""
3 ""
4 THERE was once a woman who lived with her daughter in a beautiful
5 cabbage-garden; and there came a rabbit and ate up all the cabbages. At
6 last said the woman to her daughter,
Utilizing regex, the start of each story is identified and thereby each line of text is assigned to its corresponding story. In this step the cleaned text is also transformed into a tidy data structure by making each word an observation. The sentence and paragraph in which each word appears is stored as additional variables in the dataset. This preserves the structure of the original text in varying granularity for the second research question which quantifies the written sentiment across the duration of each story. Additionally, each story is indexed according to the Aarne-Thompson-Uther classification system used for the first research question. Finally, the corpus for this project is completed by combining the fairy tales of the Brothers Grimm and Charles Perault into a single data frame. The following code executes these steps.
FairyTaleCorpus <- bind_rows(grimm,Perrault) #Combine fairy tales into a single corpus
FairyTaleCorpus <- FairyTaleCorpus[-c(47608:47617),]
FairyTaleCorpus <- FairyTaleCorpus %>%
unnest_tokens(paragraph,text,token = "paragraphs", to_lower = FALSE) %>% #Unnest text into paragraphs
mutate(story_title = ifelse(str_detect(paragraph, regex("^[12\\,\\.\\[\\]A-Z \\'\\-]+$")), paragraph, NA),
story = na.locf(story_title)) %>% # Indicate the starting location of each story
filter(is.na(story_title)) %>% # Remove fairy tale titles from the text
select(-story_title) %>% # Remove the variable story_title
mutate(story_index = row_number())%>%
mutate(p_index = row_number())%>%
unnest_tokens(sentence,paragraph,token = "sentences", drop = FALSE, collapse=FALSE) %>% #Unnest text into sentences
mutate(s_index = row_number())%>%
unnest_tokens(word,sentence,token = "words", drop = FALSE, collapse=FALSE) %>% #Unnest text into words
mutate(class = ifelse(story %in% AnimalTales, "Animal Tales", # Index each story type
ifelse(story %in% TalesOfMagic, "Tales Of Magic",
ifelse(story %in% RealisticTales, "Realistic Tales",
ifelse(story %in% TalesOfTheStupidOgre, "Tales Of The Stupid Ogre",
ifelse(story %in% AnecdotesAndJokes, "Anecdotes And Jokes",
ifelse(story %in% FormulaTales, "Formula Tales","No Category")))))))
### This code is used to fix the paragraph and sentence index so it restarts for each story.
p_ref = 0 # Initialize paragraph reference at 0
s_ref = 0 # Initialize sentence reference at 0
story_ref = 1
for (i in 2:nrow(FairyTaleCorpus)){ # For each observation...
if(FairyTaleCorpus$gutenberg_id[i]!=FairyTaleCorpus$gutenberg_id[i-1]){
pref = 1
s_ref = 0
}
if(FairyTaleCorpus$story[i]!=FairyTaleCorpus$story[i-1]){ # If you start a new story...
story_ref = story_ref + 1
p_ref = FairyTaleCorpus$p_index[i]-1 # Mark the last paragraph index of the previous story
s_ref = FairyTaleCorpus$s_index[i]-1 # Mark the last sentence index of the previous story
}
FairyTaleCorpus$story_index[i]=story_ref # Scale paragraph index for story
FairyTaleCorpus$p_index[i]=FairyTaleCorpus$p_index[i]-p_ref # Scale paragraph index for story
FairyTaleCorpus$s_index[i]=FairyTaleCorpus$s_index[i]-s_ref # Scale sentence index for story
}
FairyTaleCorpus <- FairyTaleCorpus %>%
group_by(sentence) %>%
mutate(SentenceTotal = sum(s_index) %/% s_index)%>%
ungroup()
With a completed corpus in tidy format, we move to conducting preliminary analysis on the data.
While preparing the data, fairy tales were classified according to their Aarne-Thompson-Uther Index. Prior to using this information to track sentiment across topics it is worth noting the percentage of fairy tales that fall under each category. The distribution of fairy tales into each of the 7 categories is summarized in the table below. The table includes both the overall dispursement of tales for the corpus and the dispursement of tales grouped by author.
\[\begin{array}[] TTopic & Number of Fairy Tales \\ Animal Tales & 10 \\ Tales of Magic & 37 \\ Religious Tales & 0 \\ Realistic Tales & 2 \\ Tales of the Stupid Ogre & 1 \\ Anecdotes and Jokes & 9 \\ Formula Tales & 1 \\ Not Assigned & 3 \\ \end{array}\]Additionally, we can view the total number of words from the corpus contained within each category.
table(FairyTaleCorpus$class)
Anecdotes And Jokes Animal Tales Formula Tales
12450 7779 629
No Category Realistic Tales Tales Of Magic
2321 3017 78069
Tales Of The Stupid Ogre
1451
Both the number of fairy tales and the word count contained across topics is unequal. Therefore, it is possible that this sample bias can cause error when comparing sentiment across Aarne-Thompson-Uther topics. To mitigate some of the bias, sentiment scores by category will be standardized to account for the total number of words within each classification.
Word frequency is also useful for familiarization with the data in preliminary analysis. In this section word frequency is summarized by term frequency (tf), inverse document frequency (idf), and term inverse document frequency (tf-idf). Term frequency measures the frequency of a word within a document. This is scaled by \(log_{10}\) to downweight the frequency to better measure the relevance of word frequency to the meaning of a document[4]. Thus, term frequency is calculated as
\[ tf_{t,d} = \begin{cases} 1+\log_{10}\Big(\mbox{count(t,d)}\Big)\hspace{10pt} \mbox{if count(t,d) > 0}\\\\ 0 \hspace{260px}\mbox{ otherwise } \end{cases} \]
Inverse document frequency, is the fraction of the number of documents in a corpus and the number of documents for which a given word appears. This measures how unique a word is to a particular document by reducing the weight for commonly used words and increasing the weight for words that are less frequent in a collection of documents [4]. The equation below shows how idf is calculated.
\[ \mbox{idf}(t,D) = \log\left(\frac{N}{n_{t}}\right) \]
Finally, combining tf and idf results in the term inverse document frequency which measures “high frequency words that provide particularly important context to a single document within a group of documents” [9]. This is calculated as the product of tf and idf for a particular term (t) in document (d) for a set of documents (D) and results in a value between 0 (not important) and 1 (very important).
\[ \mbox{tf-idf}(t,d,D) = \mbox{tf}(t,d) \cdot \mbox{idf}(t,D) \]
The following code calculates tf, idf, and tf-idf for the words within the corpus and visualizes the results using the ggplot2 package. The visualization of word frequency is done across authors, topics, and fairy tales in accordance to the three levels of research questions that this project addresses.
The tf plots between topics continues to confirm that the tidytext package did an adequate job removing stop words. Additionally, we see instances of terms that intuitively align to topics. For instance, in the Animal Tales category we see high frequency of animals. Nevertheless, other categories overlap, also showing high term frequency of animals. Several words that were deemed most frequent from tf also had the highest tf-idf scores. It is also interesting to note that the term ogre is not frequent in the Tales of the Stupid Ogre category even though it appeared frequent when calculating word frequency across authors.
The following code assigns word sentiment to text using the BING lexicon. The sentiment by topic is quantified as the proportion of positive and negative words which appear to the total number of words within each topic. The proportion of sentiment by topic is plotted using the ggplot2 package.
From the graph it appears that using BING sentiment lexicons, the Formula Tales category are proportionally the most negative and least positive. Animal Tales also has a higher distribution of negative sentiment than positive sentiment. Realistic Tales, Tales Of Magic, Anecdotes And Jokes and Tales of the Stupid Ogre all appear to have a higher proportion of positive sentiment than negative sentiment. Tales of Magic, Tales of the Stupid Ogre, and Anecdote and Joke categories have similarly distributed positive and negative sentiment proportions.
We can also view how each word contributed to the positive and negative sentiment score across topics. The following code generates a plot to view the top 10 words which had the most significant positive and negative contribution to sentiment. From the plot it appears that the BING sentiment lexicon does fairly well for quantifying sentiment across topics. The words that contribute to positive and negative sentiment are sound and there is not any blatant issues.
bing_word_counts_byTopic %>%
group_by(sentiment,class) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n/total, fill = sentiment)) +
geom_col(show.legend = TRUE) +
facet_wrap(~class, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
The following code assigns word sentiment to text using the AFINN lexicon. The sentiment by topic is quantified as the proportion of words in each factor of ranked sentiment, from -5 (negative) to 5 (positive), with respect to the total number of words within each topic. The proportion of sentiment by topic is plotted using the ggplot2 package.
afinn_word_counts_byTopic <- FairyTaleCorpus %>%
filter(!class %in% "No Category") %>%
inner_join(get_sentiments("afinn"))%>%
group_by(class) %>%
count(word, score, sort = TRUE) %>%
ungroup()%>%
left_join(topic_total_words)%>%
mutate(score_proportion = n/total)
afinn_word_counts_byTopic %>%
group_by(class, score)%>%
summarise(score_proportion = sum(score_proportion)) %>%
ggplot(aes(score, score_proportion, fill = class)) +
geom_col(show.legend = FALSE) +
facet_wrap(~class, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
From the graph it appears that using AFINN sentiment lexicons, the Formula Tales category have a larger proportion of strongly ranked negative sentiment compared to the other categories. Using the word contribution to sentiment scores we can see why this occurs.
afinn_word_counts_byTopic %>%
group_by(score,class) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n/total, fill = class)) +
geom_col(show.legend = TRUE) +
facet_wrap(~score, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
It appears that words like ass and cock, which refer to a donkey and rooster respectively, are not recognized by the AFINN lexicon as being characters and so these words are being ranked highly negative and are biasing the results. Removing these instances of wrongly scored words, provides a more accurate representation of sentiment scores using the AFINN lexicon.
afinn_word_counts_byTopic <- FairyTaleCorpus %>%
filter(!class %in% "No Category") %>%
filter(!word %in% c("cock", "ass", "dear"))%>%
inner_join(get_sentiments("afinn"))%>%
group_by(class) %>%
count(word, score, sort = TRUE) %>%
ungroup()%>%
left_join(topic_total_words)%>%
mutate(score_proportion = n/total)
afinn_word_counts_byTopic %>%
group_by(class, score)%>%
summarise(score_proportion = sum(score_proportion)) %>%
ggplot(aes(score, score_proportion, fill = class)) +
geom_col(show.legend = FALSE) +
facet_wrap(~class, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
While it is nice that the AFINN lexicon attempts to quantify positivity or negativity of words beyond binary classification, it is likely that the scoring of sentiment is highly subjective and could bias results for words that appear in a different context than the lexicon was trained. Thus, the second and third research question will proceed using the BING lexicon rather than the AFINN lexicon.
The following code assigns word sentiment to text using the NRC lexicon. The word sentiment is classified as either anticipation, joy, positive, surprise, trust, sadness, negative, disgust, fear, and/or anger. To compare the written sentiment across topics, the proportion of words by topic within each sentiment category is calculated.
nrc_word_counts_byTopic <- FairyTaleCorpus %>%
filter(!class %in% "No Category") %>%
filter(!word %in% c("cock","ass","dear")) %>%
inner_join(get_sentiments("nrc"))%>%
group_by(class) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()%>%
mutate(sentiment = reorder(sentiment, n)) %>%
left_join(topic_total_words)
nrc_word_counts_byTopic %>%
ggplot(aes(sentiment, n/total, fill=class))+
geom_col(show.legend = FALSE)+
facet_wrap(~class, ncol = 3, scales = "free")+
coord_flip()
From the plot above we see different proportions for expressed sentiment across topics but similar trends. For all topics the NRC lexicon finds a higher proportion of positive than negative words. Additionally, it appears that the Animal Tales category has the highest proportion of negative sentiment. The following code generates a plot to see which words contribute to the proportions of each sentiment by category.
nrc_word_counts_byTopic %>%
group_by(sentiment,class) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n/total, fill = sentiment)) +
geom_col(show.legend = TRUE) +
facet_wrap(~class, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
From the graph it appears that words can be categorized into multiple sentiment categories. Since the sentiment of a word is likely context dependent it is possible that further subdivision of word sentiment beyond positive and negative can also introduce more bias. Nevertheless, it is nice that the NRC lexicon attempts to categorize word sentiment beyond a binary classification. Thus, if using this lexicon to categorize sentiment extra caution should be taken to limit the introduction of additional subjectivity and bias.
story_sentiment_bing <- FairyTaleCorpus %>%
filter(story %in% c(" LITTLE RED RIDING-HOOD."," LITTLE RED CAP",
"CINDERELLA, OR THE LITTLE GLASS SLIPPER.", " ASCHENPUTTEL",
" THE SLEEPING BEAUTY IN THE WOODS.", " THE SLEEPING BEAUTY")) %>%
inner_join(get_sentiments("bing"))%>%
count(story, index = s_index, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive-negative)
ggplot(data = story_sentiment_bing, mapping = aes(x = index, y = sentiment, fill = story)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(facets = ~ story, ncol = 1, scales = "free_x")
story_sentiment_afinn <- FairyTaleCorpus %>%
filter(story %in% c(" LITTLE RED RIDING-HOOD."," LITTLE RED CAP",
"CINDERELLA, OR THE LITTLE GLASS SLIPPER.", " ASCHENPUTTEL",
" THE SLEEPING BEAUTY IN THE WOODS.", " THE SLEEPING BEAUTY")) %>%
inner_join(get_sentiments("afinn"))%>%
count(story, index = s_index, score)
story_sentiment_afinn <- story_sentiment_afinn %>%
group_by(story,index)%>%
summarize(tot_score = sum(score))%>%
ungroup()
ggplot(data = story_sentiment_afinn, mapping = aes(x = index, y = tot_score, fill = story)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(facets = ~ story, ncol = 1, scales = "free_x")